dCDH by_path R-parity fixtures + TestDCDHDynRParityByPath#360
Conversation
|
Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
ca4df9d to
b40ca7b
Compare
|
Rebased onto current No TROP / /ai-review |
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
… version Three findings from CI review round 2 (1 P3 + 2 P2): - P2: `POINT_RTOL` tightened from 0.025 (MIXED_POINT_RTOL) to 1e-9. Observed rtol on all 15 (path, horizon) cells is ~1e-11 across both scenarios, consistent with R's 10-digit JSON rounding. The prior 2.5% tolerance would silently accept ~6 orders of magnitude of point-estimate regression while the docs claim exact R parity. - P2: `_compare_by_path()` now asserts matching finite/positive state on `py_se` vs `r_se` BEFORE the `pytest.approx` numeric check. Previously the `if py_se > 0 and r_se > 0:` guard silently skipped SE parity when one side degraded to 0/NaN/inf while the other stayed finite; the regression now fails explicitly with a diagnostic citing both values and suggesting a variance-identifiability check. - P3: `DIDmultiplegtDYN` version pin moved from `>= 2.3.3` to `== 2.3.3` so a future release with changed `by_path` slot semantics cannot silently regenerate the fixture. Comment documents the update protocol: bump the pin AND re-run the parity class when moving to a newer known-compatible version, or extend to an explicit allowlist once a second version is verified. Tests still pass (2/2 in TestDCDHDynRParityByPath); ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings. Re-review scope here is the Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Wave 1 of the by_path follow-up sequence. Locks in the per-path SE
convention against R DIDmultiplegtDYN 2.3.3 before downstream waves
(inference completion, design variants, survey) build on it.
Extends benchmarks/R/generate_dcdh_dynr_test_values.R with an
extract_dcdh_by_path helper and two scenarios:
- mixed_single_switch_by_path: 2 observed paths, by_path=2
- multi_path_reversible_by_path: 4 observed paths, by_path=3
The multi_path_reversible DGP is deterministic: path assignment is a
function of F_g so each (D_{g,1}, F_g, S_g) cohort contains switchers
from a single path. This keeps the cohort-recentered IF comparable to
R's per-path re-run; cross-path cohort sharing is the documented SE
divergence mechanism.
New TestDCDHDynRParityByPath in tests/test_chaisemartin_dhaultfoeuille_parity.py
matches paths by tuple label via set-equality (robust to R's
undocumented frequency-tie tiebreak) and hard-asserts per-path
switcher counts before SE comparison. Observed parity: point
estimates and switcher counts match R exactly; per-path SE within
12% rtol (Phase 2 envelope) — scenario 13 max 10.2%, scenario 14
max 4.2%.
REGISTRY.md Note (Phase 3 by_path...) updated with R-parity
confirmation and a **Deviation from R (cross-path cohort-sharing
SE)** bullet describing the mechanism under which full-panel
cohort-centered plug-in (ours) and per-path re-run (R's) diverge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… version Three findings from CI review round 2 (1 P3 + 2 P2): - P2: `POINT_RTOL` tightened from 0.025 (MIXED_POINT_RTOL) to 1e-9. Observed rtol on all 15 (path, horizon) cells is ~1e-11 across both scenarios, consistent with R's 10-digit JSON rounding. The prior 2.5% tolerance would silently accept ~6 orders of magnitude of point-estimate regression while the docs claim exact R parity. - P2: `_compare_by_path()` now asserts matching finite/positive state on `py_se` vs `r_se` BEFORE the `pytest.approx` numeric check. Previously the `if py_se > 0 and r_se > 0:` guard silently skipped SE parity when one side degraded to 0/NaN/inf while the other stayed finite; the regression now fails explicitly with a diagnostic citing both values and suggesting a variance-identifiability check. - P3: `DIDmultiplegtDYN` version pin moved from `>= 2.3.3` to `== 2.3.3` so a future release with changed `by_path` slot semantics cannot silently regenerate the fixture. Comment documents the update protocol: bump the pin AND re-run the parity class when moving to a newer known-compatible version, or extend to an explicit allowlist once a second version is verified. Tests still pass (2/2 in TestDCDHDynRParityByPath); ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4 doc cleanup Two findings (1 P2 + 1 P3): - P2: `_compare_by_path` now asserts `py_path["frequency_rank"] == r_path_entry["frequency_rank"]` for every committed path. Both scenarios are constructed with unique path frequencies (scenario 13 via the mixed_single_switch pattern, scenario 14 via deterministic counts 40/25/10/5), so rank ordering is unambiguous and any regression in top-k tiebreak handling now fails explicitly instead of passing silently as long as the selected path set and per-path effects remain correct. - P3: scenario 14 generator docstring and recorded params still described the old stochastic `p_switch`-driven DGP (the pre-PR variant that blew out SE parity via cross-path cohort mixing). The `multi_path_reversible` pattern is now DETERMINISTIC: path assignment is a fixed function of F_g with counts 20/20/15/10/10/5 across the 6 F_g values. `p_switch = 0.35` dropped from both the scenario call and the `params` block in the fixture; comment block rewritten to describe the deterministic design and cite the REGISTRY note for the rationale behind the design choice. Fixture regenerated; scenario 14 params no longer carry the stale `p_switch` entry. Point and SE parity numbers unchanged (deterministic DGP produces the same treatment matrix as before). Tests pass (2/2); ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23259ed to
daefda7
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings. Re-review scope here is the new Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
When `n_bootstrap > 0` is set with `by_path=k`, per-path joint sup-t simultaneous confidence bands are now computed across horizons `1..L_max` within each path. A single shared `(n_bootstrap, n_eligible)` multiplier weight matrix (using the estimator's configured `bootstrap_weights` — Rademacher / Mammen / Webb) is drawn per path and broadcast across all valid horizons, producing correlated bootstrap distributions. The path-specific critical value `c_p = quantile(max_l |t_l|, 1-α)` is applied per horizon as `cband_conf_int = (eff - c_p·se, eff + c_p·se)` and surfaced at top level as `results.path_sup_t_bands[path]`. Closes Wave 2 #4 of the by_path follow-up sequence (#357 foundation, #360 R-parity, #364 bootstrap, #371 placebos). **Methodology asymmetry vs OVERALL** (intentional, documented): per-path sup-t draws fresh shared weights AFTER the per-path SE bootstrap block has populated `path_ses` via independent per-(path, horizon) draws. Asymptotically equivalent to OVERALL's self-consistent reuse but NOT bit-identical. Preserves RNG-state isolation for existing per-path SE seed-reproducibility tests. **Gates** mirror OVERALL: `>=2` valid horizons (finite bootstrap SE > 0) AND a strict majority (more than 50%) of finite sup-t draws to receive a band. Otherwise the path is absent from `path_sup_t_bands`. **Empty-state contract**: `path_sup_t_bands is None` when not requested (no bootstrap or `by_path is None`); `{}` when requested but no path passes both gates (covers two cases: `path_effects == {}` upstream OR all paths fail gates downstream). **Deviation from R**: `did_multiplegt_dyn` provides no joint / sup-t bands at any surface — Python-only methodology extension consistent with the existing OVERALL `event_study_sup_t_bands` (also Python-only). Inherits the cross-path cohort-sharing SE deviation from R documented for `path_effects`. **Bundled pre-audit fix** (sibling-surface check): the existing OVERALL `sup_t_bands` field's stale "Phase 2 placeholder" docstring updated to the actual contract description. Tests: new `TestByPathSupTBands` class with 13 tests covering: attr None when no bootstrap / no by_path; keys match `path_effects` with finite crit; band wider than pointwise; crit finite and positive; seed reproducibility; single-horizon-path-skip; L_max=1 skip; n_valid_horizons matches; absent-path-no-cband-keys; summary renders; empty-dict-when-no-complete-window; strict-majority-gate-at-exact-50pct (monkeypatches the weight generator to inject NaN into half the bootstrap rows, asserting both `sup_t_bands is None` and `path_sup_t_bands == {}` at the boundary). All `@pytest.mark.slow`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
by_pathfollow-up sequence — locks in the per-path SE convention against RDIDmultiplegtDYN 2.3.3before downstream waves (inference completion, design variants, survey) build on it.benchmarks/data/dcdh_dynr_golden_values.json:mixed_single_switch_by_path(2 paths,by_path=2) andmulti_path_reversible_by_path(4 paths,by_path=3via a new deterministic multi-path DGP pattern in the R generator).TestDCDHDynRParityByPathclass intests/test_chaisemartin_dhaultfoeuille_parity.py— 2 parity tests matching paths by tuple label and cross-checking per-path switcher counts before SE comparison.Observed parity
SE_RTOL = 0.12passes both with margin)Documented deviation
REGISTRY.md
Note (Phase 3 by_path...)block now includes:**Deviation from R (cross-path cohort-sharing SE):**— our full-panel cohort-centered plug-in (joiners/leavers precedent) vs R's per-path re-run diverges materially when a(D_{g,1}, F_g, S_g)cohort spans multiple observed paths; the two coincide when every cohort is single-path. Practitioner guidance: interpret per-path SE as a within-full-panel marginal variance, not a per-path conditional variance.Scenario 14's DGP is constructed to keep cohorts single-path by making path assignment a deterministic function of
F_g, demonstrating the regime where Python and R agree up to envelope.R script additions
extract_dcdh_by_pathhelper readsres$by_levels+res$by_level_1..by_level_k, capturing per-path effects / SE / CI / switcher counts per horizon."multi_path_reversible"pattern ingen_reversiblewith cohort-clean path assignment.stopifnot(packageVersion("DIDmultiplegtDYN") >= "2.3.3")at script top —by_pathoutput slots are not version-stable per R package docs.Wave queue updated
~/.claude/projects/.../memory/project_dcdh_by_path_next_prs.mdmarks Wave 1 complete with the SE-gap observation. Next up: Wave 2 (inference completion —n_bootstrap > 0, per-path placebos, per-pathsup_t_bands).Test plan
pytest tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPath -vpytest tests/test_chaisemartin_dhaultfoeuille_parity.py -v(17 passed locally)pytest tests/test_chaisemartin_dhaultfoeuille.py tests/test_methodology_chaisemartin_dhaultfoeuille.py(189 + 10 passed locally)to_dataframe(level="by_path")round-trip on R-derived data returns 6 rows × 11 columns as expectedRscript benchmarks/R/generate_dcdh_dynr_test_values.R(requires R + DIDmultiplegtDYN 2.3.3 + jsonlite)🤖 Generated with Claude Code